-
Notifications
You must be signed in to change notification settings - Fork 321
feat: recognize and use over sized allocations #523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Edit: this comment is out-of-date with the current implementation, left because it's interesting. I've figured out the rough equation to do it without a loop: fn maximum_buckets_in(allocation_size: usize, table_layout: TableLayout) -> usize {
// Given an equation like:
// z >= x * y + x
// x can be maximized by doing:
// x = z / (y + 1)
// If you squint:
// z is the size of the allocation
// y is the table_layout.size
// x is the number of buckets
// But there are details like x needing to be a power of 2,
// and there are some extra bytes mixed in (a possible
// rounding up for table_layout.align, and Group::WIDTH).
/// todo: how do I factor in the ctrl_align?
let z = allocation_size - Group::WIDTH;
let y_plus_1 = table_layout.size + 1;
prev_pow2(z / y_plus_1)
} I'm not quite sure about the Edit: I tried to find a case where ignoring the type T = (bool, ());
let table_layout = TableLayout::new::<T>();
let begin = {
// there are never less than 4 buckets
let (layout, _) = table_layout.calculate_layout_for(4).unwrap();
layout.size()
};
use rayon::prelude::*;
(begin..=(1 << 47))
.into_par_iter()
.for_each(|allocation_size| {
let buckets = maximum_buckets_in(allocation_size, table_layout).unwrap();
let (layout, _) = table_layout.calculate_layout_for(buckets).unwrap();
let size = layout.size();
assert!(
size <= allocation_size,
"failed {size} <= {allocation_size}"
);
}); I ran it for quite some time with different |
Consider `HashSet<u8>` on x86_64 with SSE: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 36 | | 8 | 7 | 40 | | 16 | 14 | 48 | Quadroupling the number of buckets from 4 to 16 does not even increase the final allocation size by 50% (48/36=1.333). This is an edge case due to the padding of the control bytes. This platform isn't the only one with edges. Here's aarch64 on an M1 for the same `HashSet<u8>`: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 20 | | 8 | 7 | 24 | | 16 | 14 | 40 | Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead of roughly doubling. Generalized, `buckets * table_layout.size` needs to be at least as big as `table_layout.ctrl_align`. For the cases I listed above, we'd get these new minimum bucket sizes: - x86_64 with SSE: 16 - aarch64: 8 This is a niche optimization. However, it also removes possible undefined behavior edge case in resize operations. In addition, it may be a useful property to utilize over-sized allocations (see rust-lang#523).
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and how many bytes the allocation ends up being: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 36 | | 8 | 7 | 40 | | 16 | 14 | 48 | | 32 | 28 | 80 | In general, doubling the number of buckets should roughly double the number of bytes used. However, for small bucket sizes for these small TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case which happens because of padding of the control bytes and adding the Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the allocated bytes from 36 to 48 (~1.3x). This platform isn't the only one with edges. Here's aarch64 on an M1 for the same `HashSet<u8>`: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 20 | | 8 | 7 | 24 | | 16 | 14 | 40 | Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead of roughly doubling. Generalized, `buckets * table_layout.size` needs to be at least as big as `table_layout.ctrl_align`. For the cases I listed above, we'd get these new minimum bucket sizes: - x86_64 with SSE: 16 - aarch64: 8 This is a niche optimization. However, it also removes possible undefined behavior edge case in resize operations. In addition, it may be a useful property to utilize over-sized allocations (see rust-lang#523).
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and how many bytes the allocation ends up being: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 36 | | 8 | 7 | 40 | | 16 | 14 | 48 | | 32 | 28 | 80 | In general, doubling the number of buckets should roughly double the number of bytes used. However, for small bucket sizes for these small TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case which happens because of padding of the control bytes and adding the Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the allocated bytes from 36 to 48 (~1.3x). This platform isn't the only one with edges. Here's aarch64 on an M1 for the same `HashSet<u8>`: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 20 | | 8 | 7 | 24 | | 16 | 14 | 40 | Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead of roughly doubling. Generalized, `buckets * table_layout.size` needs to be at least as big as `table_layout.ctrl_align`. For the cases I listed above, we'd get these new minimum bucket sizes: - x86_64 with SSE: 16 - aarch64: 8 This is a niche optimization. However, it also removes possible undefined behavior edge case in resize operations. In addition, it may be a useful property to utilize over-sized allocations (see rust-lang#523).
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and how many bytes the allocation ends up being: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 36 | | 8 | 7 | 40 | | 16 | 14 | 48 | | 32 | 28 | 80 | In general, doubling the number of buckets should roughly double the number of bytes used. However, for small bucket sizes for these small TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case which happens because of padding of the control bytes and adding the Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the allocated bytes from 36 to 48 (~1.3x). This platform isn't the only one with edges. Here's aarch64 on an M1 for the same `HashSet<u8>`: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 20 | | 8 | 7 | 24 | | 16 | 14 | 40 | Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead of roughly doubling. Generalized, `buckets * table_layout.size` needs to be at least as big as `table_layout.ctrl_align`. For the cases I listed above, we'd get these new minimum bucket sizes: - x86_64 with SSE: 16 - aarch64: 8 This is a niche optimization. However, it also removes possible undefined behavior edge case in resize operations. In addition, it may be a useful property to utilize over-sized allocations (see rust-lang#523).
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and how many bytes the allocation ends up being: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 36 | | 8 | 7 | 40 | | 16 | 14 | 48 | | 32 | 28 | 80 | In general, doubling the number of buckets should roughly double the number of bytes used. However, for small bucket sizes for these small TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case which happens because of padding of the control bytes and adding the Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the allocated bytes from 36 to 48 (~1.3x). This platform isn't the only one with edges. Here's aarch64 on an M1 for the same `HashSet<u8>`: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 20 | | 8 | 7 | 24 | | 16 | 14 | 40 | Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead of roughly doubling. Generalized, `buckets * table_layout.size` needs to be at least as big as `table_layout.ctrl_align`. For the cases I listed above, we'd get these new minimum bucket sizes: - x86_64 with SSE: 16 - aarch64: 8 This is a niche optimization. However, it also removes possible undefined behavior edge case in resize operations. In addition, it may be a useful property to utilize over-sized allocations (see rust-lang#523).
Relevant PR has merged. |
d93c955
to
89f6d1f
Compare
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and how many bytes the allocation ends up being: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 36 | | 8 | 7 | 40 | | 16 | 14 | 48 | | 32 | 28 | 80 | In general, doubling the number of buckets should roughly double the number of bytes used. However, for small bucket sizes for these small TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case which happens because of padding of the control bytes and adding the Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the allocated bytes from 36 to 48 (~1.3x). This platform isn't the only one with edges. Here's aarch64 on an M1 for the same `HashSet<u8>`: | buckets | capacity | allocated bytes | | ------- | -------- | --------------- | | 4 | 3 | 20 | | 8 | 7 | 24 | | 16 | 14 | 40 | Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead of roughly doubling. Generalized, `buckets * table_layout.size` needs to be at least as big as `table_layout.ctrl_align`. For the cases I listed above, we'd get these new minimum bucket sizes: - x86_64 with SSE: 16 - aarch64: 8 This is a niche optimization. However, it also removes possible undefined behavior edge case in resize operations. In addition, it may be a useful property to utilize over-sized allocations (see rust-lang#523).
15fcd44
to
091e23e
Compare
a3e2dbb
to
b43c880
Compare
I have rebased on the latest master and got the tests working. |
Could you write a quick benchmark to measure the added cost of allocating a new |
FYI, I intend to work on this again the week of August 25-29, 2025. |
b43c880
to
c1561b1
Compare
Allocators are allowed to return a larger memory chunk than was asked for. If the amount extra is large enough, then the hash table can use the extra space. The Global allocator will not hit this path, because it won't over-size enough to matter, but custom allocators may. An example of an allocator which allocates full system pages is included in the test suite (UNIX only because it uses `mmap`).
c1561b1
to
a983f30
Compare
@Amanieu, I rebased my work, added a benchmark, and squashed my changes. This way anyone can check out the branch and compare to master by running: # test this branch
cargo +nightly bench --bench with_capacity
# checkout the benchmark commit which doesn't have any feature
# work on it yet.
git checkout --detach c1ddda229df96f7b8171609f3490128d8928e041
# test (pretty much master)
cargo +nightly bench --bench with_capacity
# switch back to this branch
git switch -c oversized-allocations This doesn't actually return the oversized check, since it's just using the system allocator. So the overhead being measured here is "the code performs the oversized check but the check doesn't result in a larger hashtable." Here are the formatted side-by-side results on my macOS M1:
You can see results range from -2.4% to +15.7%. The results jive with what I would have suspected: the overhead is a bit higher on the smaller sizes because the overhead of the oversized check is roughly the same no matter the size, and for the larger sizes the other parts will dominate. Note that for the allocators I personally have in mind, these sizes are fairly realistic. For example:
I can check more sizes if you are interested, particular non-power-of-two sizes. |
Nice! Do you think it's worth having a fast path if the returned size is exactly the requested size? At least in the case of the global allocator this would be const-folded away due to inlining. |
I think so, especially as most hashbrown users in 2025 will not be using custom allocators. I added the path and now on macOS you can't tell really any difference between the master branch and this one, because it falls within the noise. |
Allocators are allowed to return a larger memory chunk than was asked for. If the amount extra is large enough, then the hash table can use the extra space. The Global allocator will not hit this path, because it won't over-size enough to matter, but custom allocators may. An example of an allocator which allocates full system pages is included in the test suite (UNIX only because it uses
mmap
).This implements #489.
This relies on PR #524 to increase the minimum number of buckets for certain small types, which in turn constrains the domain ofMerged.maximum_buckets_in
so that the alignment can be ignored.I haven't done any performance testing. Since this is on the slow path of making a new allocation, the feature should be doable without too much concern about overhead.There is a benchmark now you can run yourself, you can see the numbers on my Apple M1 machine here. Since then, I added a fast-path shortcut which cuts this down when the system allocator is used (or when any allocator returns what was asked for).I am definitely not an expert in swiss tables. Feedback is very welcome, even nitpicking.